Universal Character Set

The Universal Character Set (UCS), defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set (UCS) (plus amendments to that standard), is a standard set of characters upon which many character encodings are based. The UCS contains nearly one hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point.

Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts, and traditions of the world are represented in the UCS with unique code points. The inclusiveness of the UCS is continually improving as characters from previously unrepresented writing systems are added.

Since 1991, the Unicode Consortium has worked with ISO to develop The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of Unicode exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After the publication of Unicode 3.0 in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.

The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2000 that all computer systems sold in its jurisdiction would have to support GB 18030. This required computer systems intended for sale in the PRC to move beyond the BMP.

The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms.

1 Encoding forms of the Universal Character Set
2 History of ISO 10646
3 Differences between ISO 10646 and Unicode
4 Citing the Universal Character Set
5 Correlation to Unicode
6 See also
7 References
8 External links

Encoding forms of the Universal Character Set

ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP.

The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".

Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.

Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.

History of ISO 10646

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross was one of its principal architects. That standard differed markedly from the current one. It defined :

128 groups of
256 planes of
256 rows of
256 cells,

for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of control characters (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) in any one of the four bytes. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.

One could code the characters of this primordial ISO 10646 standard in one of three ways:

UCS-4, four bytes for every character, enabling the simple encoding of all characters;
UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;
UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control characters).

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardisers realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control character values), thus permitting characters like 0x0000101F; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8.^[1]

Differences between ISO 10646 and Unicode

ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. The difference between them is that Unicode adds rules and specifications that are outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.

To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Xterm, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.

Citing the Universal Character Set

ISO 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite a particular part and version, using the form ISO/IEC 10646-{part}:{year}; for example: ISO/IEC 10646-1:1993.

Correlation to Unicode

ISO/IEC 10646-1:1993 ≈ Unicode 1.1
ISO/IEC 10646-1:2000 ≈ Unicode 3.0
ISO/IEC 10646-2:2001 ≈ Unicode 3.2
ISO/IEC 10646:2003 ≈ Unicode 4.0
ISO/IEC 10646:2003 plus Amendment 1 ≈ Unicode 4.1
ISO/IEC 10646:2003 plus Amendment 1, Amendment 2, and part of Amendment 3 ≈ Unicode 5.0
ISO/IEC 10646:2003 plus Amendments 1 to 4 ≈ Unicode 5.1
ISO/IEC 10646:2003 plus Amendments 1 to 6 ≈ Unicode 5.2
ISO/IEC 10646:2011 ≈ Unicode 6.0

See §C.1 of The Unicode Standard and http://www.unicode.org/versions/Unicode6.0.0/ for more detail.

References

^ Pike, Rob (2003-04-03). "UTF-8 history". http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt.

External links

Publicly available standards (ISO) – includes a copy of ISO 10646:2003 (82 MB ZIP file, released 2006-09-28) and amendments 1 to 7 (as of 2011-04-29)
ISO/IEC JTC1/SC2/WG2, the working group in charge of ISO 10646
UTF-8 and Unicode FAQ
SIL's freeware fonts, editors and documentation
Simple but pleasant UTF-8 example testing your web browser and font capabilities.

Unicode

Unicode Consortium
ISO/IEC 10646 (Universal Character Set)

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark and Right-to-left mark Soft hyphen Zero-width non-breaking space Zero-width joiner Zero-width non-joiner Zero-width space

Miscellaneous lists	Combining character Duplicate characters Graphic characters

Processing

Algorithms	Bi-directional text Collation (ISO 14651) Equivalence

Transformation	BOCU-1 CESU-8 UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Punycode SCSU Comparison

On pairs
of code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode

Common and inherited scripts	Combining marks Diacritics Punctuation Space

Modern scripts	Arabic (diacritics) Armenian Balinese Batak Bamum Bengali Bopomofo Braille Buginese Buhid Canadian Aboriginal Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Kanji Hanja Hán tự Hangul Hanunoo Hebrew (diacritics) Hiragana Javanese Kannada Katakana Kayah Li Khmer Lao Latin Lepcha Limbu Lisu Malayalam Mandaic Meetei Mayek Mongolian Manchu Myanmar N'Ko New Tai Lue Ol Chiki Oriya Osmanya Rejang Samaritan Saurashtra Shavian Sinhala Sundanese Syloti Nagri Syriac Tagalog Tagbanwa Tai Le Tai Tham Tai Viet Tamil Telugu Thaana Thai Tibetan Tifinagh Vai Yi

Ancient and historic scripts	Avestan Brāhmī Carian Coptic Sumero-Akkadian Cypriot Egyptian Hieroglyphs Glagolitic Gothic Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kharoshthi Linear B Lycian Lydian Ogham Old Italic Old Persian Phags-pa Phoenician Old South Arabian Old Turkic Runic Ugaritic

Symbols	Cultural, political, and religious symbols Currency Mathematical operators and symbols Phonetic symbols (including IPA)

Character encodings

Character sets

Early telecommunications	ASCII ISO/IEC 646 ISO/IEC 6937 T.61 sixbit code pages Baudot code Morse code Chinese telegraph code

ISO/IEC 8859	-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16

Bibliographic use	ANSEL ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 MARC-8

National standards	ArmSCII CNS 11643 GOST 10859 GB 2312 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KPS 9566 KS X 1001 PASCII TIS-620 TSCII VISCII YUSCII

EUC	CN JP KR TW

ISO/IEC 2022	CN JP KR CCCII

MacOS codepages ("scripts")	Arabic CentralEurRoman ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari Dingbats Farsi Greek Gujarati Gurmukhi Hebrew Icelandic Japanese / ShiftJIS Korean / EUC-KR Roman Romanian Symbol Thai / TIS-620 Turkish Ukrainian

DOS codepages	437 720 737 775 850 852 855 857 858 860 861 862 863 864 865 866 869 Kamenický Mazovia MIK Iran System

Windows codepages	874 / TIS-620 932 / ShiftJIS 936 / GBK 949 / EUC-KR 950 / Big5 1250 1251 1252 1253 1254 1255 1256 1257 1258 1361 54936 / GB18030

EBCDIC codepages	37/1140 273/1141 277/1142 278/1143 280/1144 284/1145 285/1146 297/1147 420/16804 424/12712 500/1148 838/1160 871/1149 875/9067 930/1390 933/1364 937/1371 935/1388 939/1399 1025/1154 1026/1155 1047/924 1112/1156 1122/1157 1123/1158 1130/1164 JEF KEIS

Platform specific	ATASCII CDC display code DEC-MCS DEC Radix-50 Fieldata GSM 03.38 HP roman8 PETSCII TI calculator character sets WISCII ZX Spectrum character set

Unicode / ISO/IEC 10646	UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-7 UTF-1 UTF-EBCDIC GB 18030 SCSU BOCU-1

Miscellaneous codepages	APL Cork HZ IBM code page 1133 KOI8 TRON

Related topics	control character (C0 C1) CCSID Character encodings in HTML charset detection Han unification ISO 6429/IEC 6429/ANSI X3.64 mojibake